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Abstract 

Many scientific applications are I/O intensive and gen- 
erate large data sets, spanning hundreds or thousands of 
"files." Management, storage, efficient access, and analysis 
of this data present an extremely challenging task. We have 
developed a software system, called Scientific Data Man- 
ager ( SDM), that uses a combination of parallel file I/O and 
database support for high-performance scientific data man- 
agement. SDM provides a high-level API to the user and, in- 
ternally, uses a parallel file system to store real data and a 
database to store application-related metadata. In this pa- 
per, we describe how we designed and implemented SDM to 
support irregular applications. SDM can efficiently handle 
the reading and writing of data in an irregular mesh, as well 
as the distribution of index values. We describe the SDM 
user interface and how we have implemented it to achieve 
high performance. SDM makes extensive use of MPI-IO's 
noncontiguous collective I/O functions. SDM also uses the 
concept of a history file to optimize the cost of the index dis- 
tribution using the metadata stored in database. We present 
performance results with two irregular applications, a CFD 
code called FUN3D and a Rayleigh-Taylor instability code, 
on the SGI OriginlOOO at Argonne National Laboratory. 



1. Introduction 

Many large-scale scientific applications are I/O intensive 
and generate large amounts of data (on the order of several 
hundred gigabytes to terabytes) [||, ^5|]. Many of these ap- 
plications perform their computation and I/O on an irreg- 
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ularly discretized mesh. The data accesses in those appli- 
cations make extensive use of arrays, called indirection ar- 
ray ^] or map array |jT^, in which each value of the 
array denotes the corresponding data position in memory or 
in the file. 

The data distribution in irregular applications can be 
done either by using compiler directives with the support 
of runtime preprocessing [ pi] , or by using a runtime 
library Most of the previous work in the area of 

unstructured-grid applications focuses mainly on computa- 
tion and communication in such applications, not on I/O. 

We have developed a software system for large-scale sci- 
entific data management, called Scientific Data Manager 
(SDM) [^, that combines the good features of both file I/O 
and databases. SDM provides a high-level, user-friendly in- 
terface. Internally, SDM interacts with a database to store 
application-related metadata and uses MPI-IO to store the 
real data on a high-performance parallel file system. SDM 
takes advantage of various I/O optimizations available in 
MPI-IO, such as collective I/O and noncontiguous requests, 
in a manner that is transparent to the user. As a result, users 
can access data with the performance of parallel file I/O, 
without having to bother with the details of file I/O. 



In a previous paper |23|, we described the use of SDM 
for regular applications. In this paper, we describe the API, 
design, and implementation of SDM for irregular applica- 
tions. SDM can efficiently handle the reading and writing of 
data in an irregular mesh, as well as the distribution of index 
values. SDM also uses the concept of a history file to opti- 
mize the cost of the index distribution using the metadata 
stored in database. We present performance results with 
two irregular applications, a CFD code called FUN3D and 
a Rayleigh-Taylor instability code, on the SGI Origin2000 
at Argonne National Laboratory. 

The rest of this paper is organized as follows. In Sec- 
tion ^ we discuss our goals in developing SDM for irreg- 
ular problems. In Section |3] we present a typical irregular 
problem and describe the detailed implementation issues of 



SDM to solve the problem. Performance results on the SGI 
Origin2000 at Argonne National Laboratory are presented 
in Section Q We discuss related work in Section]^ and con- 
clude in Section ^ 

2. Design Objectives 

Our main objectives in designing SDM for irregular ap- 
plications were to achieve high-performance parallel I/O, to 
provide a convenient high-level API, and to optimize the 
execution cost of irregular applications. 

• High-Performance I/O. To achieve high-performance 
I/O, we decided to use a parallel file-I/O system to 
store real data and use MPI-IO to access this data. 
MPI-IO, the I/O interface defined as part of the MPI-2 
standard [ p^ p^ , is rapidly emerging as the standard, 
portable API for I/O in parallel applications. MPI-IO 
is specifically designed to enable the optimizations that 
are critical for high-performance parallel I/O. Exam- 
ples of these optimizations include collective I/O, the 
ability to access noncontiguous data sets, and the abil- 
ity to pass hints to the implementation about access 
patterns, file-striping parameters, and so forth. 

• High-Level API. Our goal was to provide a high- 
level unified API for any kind of application (regular 
or irregular) while encapsulating the details of either 
MPI-IO or databases. With SDM, user can specify 
the data with a high-level description, together with 
annotations, and use a similar API for data retrieval. 
SDM internally translates the user's request into ap- 
propriate MPI-IO calls, including creating MPI de- 
rived datatypes for noncontiguous data p2|]. SDM also 
interacts with the database when necessary, by using 
embedded SQL functions. 

• Optimization for Irregular Applications. In inegu- 
lar applications, the cost of an index distribution is usu- 
ally expensive, in terms of communication and com- 
putation. In SDM, after partitioning the index values 
among processes, the local index subsets of all pro- 
cesses are asynchronously written to a history file, and 
the associated metadata is stored in database. When 
the same index distribution is needed in subsequent 
runs, the index values are read from the history file 
using the metadata stored in database, and thereby the 
user can avoid repeating the communication and com- 
putation for the same index distribution. 

3. Implementation 

We discuss the SDM API for solving a sample irregular 
problem and show how the API is implemented. 



3.1. An Irregular Problem and SDM API 



/* Assume that a pariilioning veclor already resides in 
processor memory */ 

/* edgel. edge2, x and y are read from a file "uns3d.msh" */ 

Int *edgel, *edge2 
Double *x, *y, *p, *q 

Read edgel, edge2 

Distribute edgel, edge2 by using ihe partitioning veclor 
Read x 

Distribute x by using ihe parfilioned edges 
Ready 

Distribute y by using ihe parlilioned nodes 

For(l=l;l<maxSiep: i 
Compute results p and q by using the partitioned 
x, y, edgel and edge2 

(For each checkpoinl) | 
Write results p and q to files ordered by 

global node numbers 
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Figure 1. A sample irregular problem and its 
solution 

Figure |l] shows a typical irregular problem that sweeps 
over the edges of an irregular mesh. In this problem, edge 1 
and edge2 are two arrays representing nodes connected by 
an edge, and arrays x and y are the actual data associated 
with each edge and node, respectively. The partitioned ar- 
rays of edgel, edge 2, x, and y contain a single level of 
"ghost data" beyond the boundaries to minimize remote ac- 
cesses. After the computation is completed, the results p 
and q are written to a file in the order of global node num- 
bers. 

Figures Hand ^respectively show the SDM API for writ- 
ing the results p and q and for partitioning edgel, edge 2, 
X, and y among processes to solve the problem described in 
Figure [l|. We use the term import to distinguish it from a 
read operation. A read operation reads the data created in 
SDM, whereas an import operation reads the data created 
outside of SDM. 

3.2. Implementation Details 

The partitioning vector is the one generated from a par- 
titioning tool, such as MeTis[[l5[ |6[. Each value of the 
vector denotes a processor rank where the node should be 
assigned. In SDM, the partitioning vector should be repli- 
cated among processes. Next, the map array is the one that 
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SDMjnitialize(nameOf Application); 
result = SDM_make_datalist(2, {p, q}); 
result[0]. datatype = DOUBLE; 
SDM_associate_attributes(2, &result[0]); 
handle = SDM_set_attributes(2, result); 



/* Partition edgel, edge2, x and y among processes 
(Figure 3) */ 



SDM_data_view(handle, 2, p, &vector, &localNodes); 
For (t=l; t < max Step; t++) { 



Do computation and produce results p and q; 



For (each checkpoint) { 

SDM_write(handle, p, t, pBuf); 
SDM_write(handle, q, t, qbuf); 

} 

} 

SDMJinaHze(handle, 2); 



Figure 2. SDM API for writing results 

specifies the mapping of each element of the local array to 
the global array. This map array is created in SDM after 
partitioning the indexes using a partitioning vector, or the 
map array can be specified by the user. 

Figure ^ shows the steps involved in initializing SDM 
to solve the problem in Figure |l]. Running the problem 
on SDM begins by calling the SDMJnitialize to establish 
database connection (for storing metadata). Six database 
tables, runJable, access^atternJable, execution Jable, im- 
port Jable, index Jable, and index Jiistory Jable, are created 
to store the metadata associated with the application. Since 
two data sets, p and q, are produced as a result of compu- 
tations and they have the same data type and global size, 
these data sets are grouped in a data group to experiment 
different ways of organizing data in files. All the metadata 
associated with these data sets are stored in a database in 
the SDMjietMttributes. 

Figure || describes the steps in SDM to partition the in- 
dexes and data. The four arrays, edgel, edge2, x, and y, 
are imported by creating a data group. Since these arrays 
have been created outside of SDM, the user has no con- 
trol over the arrays except to read them, by specifying their 
data type, appropriate file offset, and length. The user need 
not create several data groups to import the arrays. In the 
SDMjnakeJmportlist, the metadata of this imported data 
group, including a mechanism for the import (partition), is 
stored in the importJable for a later use. 

In order to partition edgel and edge2, the 
SDMJmport is called to import the arrays with the 



parameters of file handle, their position in the data group, 
file offset, file length, and user buffer to hold the data. The 



import = SDM_make_datalist(4, {edgel, edge2, x, y}); 
import[2]. datatype = DOUBLE; 
SDM_associate_attributes(2, &import[2]); 
SDM_make_importlist(handle, 4, import); 

SDMJmport(handle, edgel, 0, totalEdges, tmp); 
SDM jmport(handle, edge2, (totalEdges*sizeof(int)), 
totalEdges, tmpH-(totaIEdges * sizeof (int))) ; 

/* Distribute edgel and edge2 among processes */ 
vector = SDM_partition_table(handle, 

partitioning_vector, totalNodes); 
partitioned_edge = SDM_partition jndex(handle, 
partitioning_vector, totalNodes, &tmp, &vector); 

localEdges = SDM_partitionJndex_size(handle); 
localNodes = SDM_partition_data_size(handle); 

/* Make a history of this index distribution */ 

SDM jndex_registry(handle, partitioned_edge, vector); 

/* Import X */ 

file_offset = 2*totalEdges*sizeof(int); 
SDM_data_view(handle, 1, x, &partitioned_edge, 

(felocalEdges); 
SDM jmport(handle, x, file_offset, totalEdges, xBuf); 

/* Import y */ 

file_offset += totalEdges * sizeof (double); 
SDM_data_view(handle, 1, y, &vector, &locaINodes); 
SDMJmport(handle, y, file_offset, totalNodes, yBuf); 

SDM_releaseJmportlist(handle, 4); 



Figure 3. SDM API for partitioning indexes 
and data 

SDMJmport first accesses the index Jable in the database 
to see whether a history file exists with this problem size. 
If so, the metadata, such as each process's partitioned 
index size and the history file name, is retrieved from the 
indexjable and index Jiistory Jable, and the control exits 
the SDMJmport. Otherwise, the desired data is imported 
to the application. Since edgel and edge2 are being 
imported in a contiguous way, there is no need to specify 
data mapping between the file and processor memory. In 
the SDMJmport, the total domain (file length) is equally 
divided among processes, and the data in the domain is 
contiguously imported into the application. In our example. 
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edges and 1 are imported to process 0, and edges 2 and 
3 are imported to process 1 . 

In the SDM partition Jable, the global partitioning vec- 
tor, partitioning.vector in Figure ||, is converted to 
the local vector, vector in Figure ||, to determine which 
node should be assigned to which process. In the example, 
nodes and 3 are assigned to process 0, and nodes 1, 2, 
and 4 are assigned to process 1 . 

If there is a history file for this problem size, the 
SDM partition Judex reads the already partitioned edgel 
and edge 2 from the history file and converts them to the lo- 
calized edges by using the partitioning vector This avoids 
the communication cost to exchange each process's edges 
and the computation cost to choose the edges to be assigned. 
The disadvantage of the history file is that it cannot be used 
if the program is run on a different number of processes 
from when the file was created, because the edges and nodes 
being assigned to each process dynamically change among 
different numbers of processes. One efficient use of the his- 
tory file is to create it in advance for the various numbers of 
processes of interest. As long as the user runs the applica- 
tion with any of those numbers of processes, an appropriate 
history can be chosen to reduce communication and com- 
putation costs. If there is no history file, the edges in each 
process are distributed by reading all the data in parallel and 
performing a ring-oriented communication. 

If at least a node of an edge has been partitioned to a pro- 
cess, the edge is assigned to the process. For example, edge 
is assigned both to process and 1 because one node of 
the edge, edgel 0, has been partitioned to process and 
the other node, edge2 1, has been partitioned to process 
1 . This edge is a ghost edge of both processes being stored 
to minimize communication volumes. 

For storing the partitioned edges and nodes, including 
the ghost ones, a certain amount of memory space is ini- 
tially allocated to each process. When the entire memory 
space is occupied by the partitioned data, it is automatically 
doubled by adjusting the memory size. This prevents the 
system from looking through the entire data in two steps, 
one step to decide the size of memory space and the other 
step to actually store the data in the memory space. 

After the edges and nodes are distributed, the edges in 
each process are moved to the next process located at a ring 
network. In the example, process receives edges 2 and 
3, and process 1 receives edges and 1 to partition them 
as described above. After finishing the edge distribution, 
edges and 2 are assigned to process 0, and edges 0, 1, 
and 3 are assigned to process 1. Similarly, nodes 0,1, and 
3 are assigned to process 0, and nodes 0, 1, 2, and 4 are 
assigned to process 1. In Figure ||, partitioned_edge 
contains the edges assigned to each process, and vector 
contains the nodes assigned to it. These are the two map 
arrays to distribute the physical data associated with each 



edge and node, respectively. 

If the SDMJndex_registry was executed for the first time 
and no history file was created earlier, the metadata of the 
partitioned edges, such as the partitioned size of each pro- 
cess, is stored in the database tables indexJable and in- 
dexjiistoryjable. Also, the partitioned edges are asyn- 
chronously written to a history file to be retrieved in sub- 
sequent runs requiring the same edge distribution. The use 
of the SDMJndexj-egistry is optional. If the user does not 
call the SDMJndexj-egistry, no history file is created after 
partitioning the edges. 

In order to import and partition data x and y in the 
SDMJmport, the SDM_data_view must be called to define 
the data mapping between a noncontiguous global view of 
the file and a local view of the processor memory. Using 
the data mapping, in the SDMJmport, the associated data is 
irregularly distributed by calling a collective MPI-IO func- 
tion. In the SDM _releaseJmportlist, the structures being 
used to irnport data in the file handle are free. 

Figure g shows the steps to write two data sets, p and q, 
after completing the computations at each checkpoint. Be- 
fore writing p and q, the data mapping to write is defined in 
the SDMjJata_view using the map array (vector) associ- 
ated with the node partition. 

SDM supports three different ways of organizing data in 
files. In level 1, each data set generated at each time step 
is written to a separate file. This file organization is simple, 
but it incurs the cost of a file-open, file-view to define the 
visible portion of a file for each process and a file-close at 
each time step. In level 2, each data set (within a group) is 
written to a separate file, but different iterations of the same 
data set are appended to the same file. This method results 
in a smaller number of files and smaller file-open and file- 
view costs. The offset in the file where data is appended 
is stored in the execution Jable. In level 3, all iterations of 
all data sets belonging to a group are stored in a single file. 
As in level 2, the file offset for each data set is stored in 
the execution Jable by process in the SDMjwrite function. 
The idea is that if a file system has high file-open and file- 
close costs, and an application generates a high file-view 
cost, as in irregular applications, SDM can generate a very 
small number of files. However, if an application produces 
a large number of data sets with a large problem size, level 3 
file organization would result in very large files, which may 
degrade the performance. 

Figure ^ depicts the metadata storage in the database and 
the organization of data in files in SDM for the example in 
Figure 

4. Performance Results 

We obtained performance results on the SGI Origin2000 
at Argonne National Laboratory. The Origin2000 has 128 
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(Database tables for writing and reading simulation results) 
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(Database tables for importing and partitioning indexes and data) 
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Figure 4. SDM execution flow to solve for the 
example in Figure |l] 



processors and 10 Fibre Channel controllers connected to 
a total of 110 disks of 9 GBytes capacity each. The file 
system on the Origin2000 is SGl's XFS 10 |^ For the 
results, we used XFS buffered I/O and MySQL [po|| to store 
the metadata. 

The first application template that we benchmarked was 
a tetrahedral vertex-centered unstructured grid code devel- 
oped by W. K. Anderson of the NASA Langley Research 
Center [|l]]. This application uses a partitioning vector gen- 
erated from MeTis to partition the nodes and edges in a 
mesh. To evaluate SDM ported to the application, we used 
about 18M edges and 2M nodes. At the initial stage, the 
application imports edges, four data arrays associated with 
edges, and another four data arrays associated with nodes. 
The total imported data size was about 807 MBytes. As 
a result of computations, the application wrote about 21 
MBytes of four data sets each and 105 MBytes of a single 
data set. Using 64 processors, we iterated the application 
template two time steps; at each time step, five data sets 
were written to files. 

The second application template that we ran was a 
Rayleigh-Taylor instability application that is motivated 
by a joint project between the University of Chicago and Ar- 
gonne to study thermonuclear flashes on astrophysical ob- 
jects. Whenever the current time reaches a certain point, 
the application writes two data sets: a single node data set 



associated with vertices in a mesh, and a triangle data set 
associated with triangles on tetrahedral faces. In the appli- 
cation template, we wrote about 36 MBytes of the node data 
set and about 74 MBytes of the triangle data set at each time 
step. Since we iterated the template five times, the total data 
size written was approximately 550 MBytes. 

4.1. Results for FUN3D 
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Figure 5. Execution time for partitioning in- 
dices and data in FUN3D 



Figure ^ shows the bandwidth to import and partition 
IBM edges, four data sets each of 144 MBytes of data as- 
sociated with edges, and another four data sets each of 21 
MBytes of data associated with nodes. The original version 
of the application — without using SDM — performs all the 
I/O operations by a single process (process 0), which then 
broadcasts data to other processes. SDM performs I/O in 
parallel from all processes using MPl-lO. The bar labeled 
index distri . in Figure |] shows the communication 
and computation costs to partition the edges after import- 
ing them to the application. Also, the bar labeled import 
shows the cost of reading the edges and eight data arrays. 

The original application reads the edges in two steps: one 
step to determine the amount of memory to store the parti- 
tioned edges and the other step to actually read the edges. 
SDM, however, extends the allocated memory dynamically 
as needed (using C function realloc) and is therefore 
able to read the partitioned edges in a single step. This con- 
tributes to the reduced cost of index distri. when 
using SDM. When partitioning the edges with a history file. 
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the cost of index distri . is nothing but reading the 
history file of the edges in a contiguous way, including the 
database cost to access the metadata. Since the history file 
contains the already partitioned edges, there is no need to 
import the edges; hence, the read cost in import is re- 
duced. 

150.0 I 1 




Write Read Write Read Write Read 

Level 1 Level 2 Level 3 



Figure 6. I/O bandwidth for reading and writ- 
ing data in FUN3D 

Figure ^ shows the I/O bandwidth for writing and then 
reading back the data generated from the application using 
64 processors. The total data size was approximately 379 
MBytes. In level 1, each data array is written to separate 
files, resulting in the creation of 10 different files. Each 
time the data array is written to files, level 1 requires the 
cost for opening a file and defining an MPI-IO^Ze view to 
access the data from the portion of the file pointed by the 
global file offset. In level 2, however, each data array gen- 
erated at each time step is appended in five files, generat- 
ing five file-open and file-view costs. This reduced number 
of files improves the I/O performance slightly. In level 3, 
only two files are generated, resulting in the best I/O per- 
formance among the three file organizations. On the SGI 
Origin2000, the difference between three file organizations 
is not significant because the file-open cost is small. 

4.2. Results of RT Application 

Figure |^ shows the I/O bandwidth for writing approxi- 
mately 550 MBytes of data. In the original application, the 
write operation is performed sequentially. In other words, 
after seeking the starting position in a file, processes write 



their local portion of data one by one. When we ported the 
application to SDM, the I/O performance increased signifi- 
cantly because of the I/O optimizations of MPI-IO. 

In SDM, we wrote the node data set according to the 
global node number of the partitioned nodes, and wrote the 
triangle data set contiguously. Since two data sets are writ- 
ten to files separately, SDM supports two different ways of 
file organization: level 1 and level 2/3 (levels 2 and 3 are 
identical in this case). As can be seen in Figure ^ on the 
SGI Origin2000, changing the file organization does not af- 
fect the I/O performance, since the cost of file-open and 
file-view is very low. 

When the number of processors increases to write the 
same data size, we can see the degradation of the I/O per- 
formance. With 32 processors, the data size being written 
at each time step is about 1 MByte for the node data set 
and 2 MBytes for the triangle data set. If the number of 
processors goes up to 64, the buffer size of each process 
becomes smaller, resulting in the performance reduction. 
Clearly, there is an optimal buffer size that shows the best 
I/O performance. 
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Figure 7. I/O bandwidth for RT 



5. Related Work 

Several efforts have sought to optimize I/O in parallel file 
systems and runtime libraries ^ |14[ |16[ |18|, |2[|^ |31|] . 
SRB (Storage Resource Broker) ||2| provides an uniform in- 
terface to access various storage systems, such as file sys- 
tems, Unitree, HPSS and database objects. However, it does 
not fully support the optimizations implemented in MPI- 
IO. Shoshani et al. p9| ] describe an architecture for op- 
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timizing access to large volumes of scientific data stored 
on tapes. The Active Data Repository Jit} ] and DataCut- 
ter [Ql optimize storage, retrieval, and processing of very 
large multidimensional datasets. The main difference be- 
tween our work and other efforts in I/O is that SDM aims to 
combine the good features of parallel file I/O and databases, 
whereas other efforts focus on either parallel I/O or data 
management, not both. 

6. Summary 

We have described the SDM system, API, and imple- 
mentation for I/O in irregular applications. SDM provides 
an easy-to-use user interface for managing large data sets 
and internally uses MPI-IO for high-performance I/O and a 
database for storing metadata. We studied the performance 
of SDM using two irregular applications: FUN3D and RT. 
When we ported both applications to use SDM, there was a 
significant improvement in I/O performance compared with 
the original application. Also, we observed that using a his- 
tory file for the index distribution helped to reduce the com- 
putation and communication costs. However, changing the 
SDM file organization from level 1 to level 3 did not greatly 
affect the performance on the SGI Origin2000, because of 
its low file-open and file-view costs. 

We plan to develop SDM further to support visualiza- 
tion applications and to investigate whether SDM can ef- 
fectively be used as a strategy for implementing libraries 
such as HDF [|l|] and netCDF @. 
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